Abstract: BotometerLite is advertised as a lightweight bot detector that improves scalability by focusing on only user profile information; furthermore, BotometerLite claims that using fewer features only entails a small compromise in individual accuracy. We test the validity of this claim by comparing Botometer with BotometerLite bot likelihood scores for 75,000 users across 5 data sets. We randomly sampled 15,000 users from the following data sets: Coronavirus, 2016 election, News outlets, Charlottesville, and the Twitter API. BotometerLite scores varied drastically from Botometer scores.

Introduction

Botometer is one of the most popular bot detection tools used in social science Rauchfleisch and Kaiser (2020). However, due to Botometer API rate limits, Beskow et al. (2018) recommends a tiered framework for bot detection and suggests models that focus only on user profile information can be used at scale for general estimates of bot penetration.

Yuan, Schuchard, and Crooks (2019) used DeBot for large-scale bot annotations when examining tweets related to the 2015 California Disneyland measles outbreak. Whereas, Broniatowski, Hilyard, and Dredze (2016) used Botometer for small-scale bot annotations.

Dunn et al. (2020) annotated bots based on Botometer scores of 0.5 of greater when assessing the limited role of bots in spreading vaccine-critical information. Botometer’s FAQ page explicitly states “It’s tempting to set some arbitrary threshold score and consider everything above that number a bot and everything below a human, but we do not recommend this approach. Binary classification of accounts using two classes is problematic because few accounts are completely automated”. Instead, Botometer recommends setting a threshold on the CAP score. Dunn et al. (2020) acknowledges the imprecision inherent in bot detection and states the conclusions of this study are robust to differences related to imprecision of the bot proportion estimate.

Botometer was initially launched in May 2014 and BotometerLite was released in September 2020. BotometerLite improves scalability by focusing on only user profile information; furthermore, BotometerLite claims that using fewer features only entails a small compromise in individual accuracy Yang et al. (2020). The training and performance evaluation of BotometerLite is described in “Scalable and Generalizable Social Bot Detection through Data Selection” Yang et al. (2020).

Rauchfleisch and Kaiser (2020) found Botometer scores are imprecise at estimating bots, especially in a different language, and prone to variance over time a high number of human users as bots and vice versa.

Contribution

Many researchers annotate bots based on Botometer score thresholds, in line with the precedent established in previous literature (add citations). Understanding how BotometerLite performs in comparison to Botometer is critical to prevent people from thinking BotometerLite can be used as a scalable substitute for Botometer.

Research Questions

In this study, we seek to answer the following questions:

  • How similar are Botometer and BotometerLite ratings?
  • Is BotometerLite effective at identifying specific types of bots? In other words, are BotometerLite scores strongly correlated with any of the Botometer category scores (e.g., spammers, fake followers, etc.)?
  • Can BotometerLite be used as an triage tool to identify a subset of accounts that require more extensive evaluation via Botometer?
  • Do some topics have more assessed bots than others? How do the topic-specific bot category scores compare to a random sample of twitter users?

Bot Type Scores

The Botometer FAQ section assigns bot scores based on the following categories:

  • Astroturf: manually labeled political bots and accounts involved in follow trains that systematically delete content
  • Fake follower: bots purchased to increase follower counts
  • Financial: bots that post using cashtags
  • Self declared: bots from botwiki.org
  • Spammer: accounts labeled as spambots from several datasets
  • Other: miscellaneous other bots obtained from manual annotation, user feedback, etc.

Complete Automation Probability is defined as the probability, according to our models, that an account with this score or greater is a bot.

The Botometer website uses the CAP to express the percentage of accounts with bot score above a given account that are labeled as humans. Think of this as the chances that you would wrongly classify a human as a bot if you used this account’s score as a threshold. You would want this probability to be pretty small, say less than 5%. (For the statisticians, this is a p-value.)

Methodology

  1. Randomly sample 20,000 tweet IDs from GWU’s Tweet Sets Library in the following collections:
  2. Rehydrate the tweet IDs via Twarc to obtain user IDs.
  3. Randomly sample 17,000 unique user IDs from the 20,000 tweets (some users are duplicated in the initil 20K sample).
  4. Query the Botometer API for 17,000 users (not all users will return a Botometer score)
  5. Query the BotometerLite API for 17,000 users (not all users will return a BotometerLite score)
  6. Randomly select 15,000 users who have both Botometer and BotometerLite scores.
  7. Compare distribution of CAP scores across data sets.
  8. Compare proportion of bot category scores > k across data sets (e.g., how many accounts had astroturf scores greater than k in each data set? did one data set have siginificantly more than others?)
  9. Calculate correlation between Botometer and BotometerLite scores.

Results

The following preliminary results explore the similarity between Botometer and BotometerLite scores for 8,685 users sampled from a 5G conspiracy theory tweet set. (Note: 10,000 users were submitted to Botometer and 8,685 accounts had scores. I’m not sure what caused some to come back without a score, but think they may have be suspended or deactivated accounts.)

I am currently collecting botscores from the GWU Tweet Sets library to be used in the final paper.

Raw Score Correlations

BotometerLite is most similar to the Botometer fake follower and spammer scores with \(R^2\) values of 0.394 and 0.334, respectively. Hence, if Botometer scores are accurate, BotometerLite may be somewhat effective at identifying some fake followers and spammers.

The pearson correlation matrix (\(R^2\) values are the square of the values of this matrix) also shows the scores are weakly correlated.

#

Estimated Bot Proportion

2,352 (27%) of the sampled users have a Complete Automation Probability (CAP) of 0.75 or greater. CAP is the probability that an account with this score or greater is at bot. Therefore, if we model an accounts bot status as a Poisson binomial random variable, the expected number of bots is given by:

\(E[\sum x_{i}]=\sum p_{i}\)


Hence, we should expect 1,905 (21.9%) of the 8,685 accounts to be bots if we only consider CAP scores of 0.75 or greater.

If we don’t set a CAP threshold and sum all CAP scores, the expected number of bots is 4,531 - 52.1% of the accounts. In other words, the Botometer CAP scores estimate that over half of the users discussing 5G on Twitter are bots.

Conclusion

Future work for course project:

  • Update introduction to include other articles that have critiqued Botometer
  • For EM6574 only, replicate results of Indiana University BotometerLite paper (Train a classifier to predict manually labeled bots and compare with BotometerLite)
  • Post code to github repo

Questions:

  • Should I expand this beyond Botometer and include a comparison of scores generated from other models (e.g., DeBot, BotSlayer, etc.)
  • Has anyone used DeBot? I requested an API key but did not receive a response.
  • Do I have “the right” number for each of my samples (15,000 users per data set)? Is a random sample appropriate?
  • What statistical tests should we do to provide evidence Botometer and BotometerLite produce different results? t-test for difference of means? F-test for difference of variance? Hotelling test across all bot scores?
  • Should we look at just the scores derived from english-speaking accounts or also include the universal scores?
  • What visualizations should we use? tSNE separating accounts with CAP > 0.8 from those with CAP < 0.8?

References

Beskow, David, Kathleen M Carley, Halil Bisgin, Ayaz Hyder, Chris Dancy, and Robert Thomson. 2018. “Introducing Bothunter: A Tiered Approach to Detection and Characterizing Automated Activity on Twitter.” In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation. Springer.

Broniatowski, David A, Karen M Hilyard, and Mark Dredze. 2016. “Effective Vaccine Communication During the Disneyland Measles Outbreak.” Vaccine 34 (28). Elsevier: 3225–8.

Dunn, Adam G, Didi Surian, Jason Dalmazzo, Dana Rezazadegan, Maryke Steffens, Amalie Dyda, Julie Leask, Enrico Coiera, Aditi Dey, and Kenneth D Mandl. 2020. “Limited Role of Bots in Spreading Vaccine-Critical Information Among Active Twitter Users in the United States: 2017–2019.” American Journal of Public Health 110 (S3). American Public Health Association: S319–S325.

Rauchfleisch, Adrian, and Jonas Kaiser. 2020. “The False Positive Problem of Automatic Bot Detection in Social Science Research.” Berkman Klein Center Research Publication, nos. 2020-3.

Yang, Kai-Cheng, Onur Varol, Pik-Mai Hui, and Filippo Menczer. 2020. “Scalable and Generalizable Social Bot Detection Through Data Selection.” In Proceedings of the Aaai Conference on Artificial Intelligence, 34:1096–1103. 01.

Yuan, Xiaoyi, Ross J Schuchard, and Andrew T Crooks. 2019. “Examining Emergent Communities and Social Bots Within the Polarized Online Vaccination Debate in Twitter.” Social Media+ Society 5 (3). SAGE Publications Sage UK: London, England: 2056305119865465.